Introduction

In late 2019, a novel coronavirus, severe acute respiratory synfrome coronavirus 2 (SARS-CoV-2), which cause a severe respiratory disease, emerged in Wuhan, China. The virus spread all over the world, and infected more than 750 million people (WHO 2020). It is reported that patients in different population shows different trend of symptoms. For example, females are less likely to infect SARS-CoV-2 (Steeg 2016). However, mechanisms behind such a difference is still unknown. Lieberman et. al. examined gene expression in response to SARS-CoV-2 infection with shotgun RNA Sequencing to see the gene expression changes(Lieberman 2020). Test condition is 430 SARS-CoV-2 patients and control condition is 54 non-infected individuals.
In the first assignment, RNASeq data of the paper by Lieberman(Lieberman 2020) were obtained from GEO, and its ID is GSE152075. There are 35784 genes in this dataset, and 430 samples from SARS-CoV-2 patients i.e. testing samples and 54 samples from controls. The RNASeq data was mapped to HUGO gene symbols, cleaned e.g. duplicates and low counts were removed, normalized with TMM normalization. MDS plot indicates a good quality of the dataset as positive and negative samples are well-separated. In this assignment, the normalized data is analyzed with differential gene expression analysis, and genes are ranked. Then thresholded over-representation analysis reveals the notably expressed gene.

Setup

if (!requireNamespace("knitr", quietly = TRUE))
  BiocManager::install("knitr")
if (!requireNamespace("edgeR", quietly = TRUE)) 
  BiocManager::install("edgeR")
if (!requireNamespace("scales", quietly = TRUE)) 
  BiocManager::install("scales")
if (!requireNamespace("circlize", quietly = TRUE))
  BiocManager::install("circlize")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
  BiocManager::install("ComplexHeatmap")
if (!requireNamespace("grid", quietly = TRUE))
  BiocManager::install("grid")

Differential Gene Expression

Data Loading

The normalized count data and groups from the first assignment are loaded.

# The normalized count data
normalized_count <- read.table("normalized_counts.txt")
# The beginning of the dataset.
knitr::kable(normalized_count[1:15,1:5], format = "html")
POS_001 POS_002 POS_003 POS_004 POS_005
A1BG 0.000000 1.53141 0.000000 0.000000 19.416692
A1CF 0.000000 0.00000 3.759022 0.000000 0.000000
A2M 265.016689 55.13076 157.878925 153.605438 89.532527
A2ML1 7.681643 0.00000 0.000000 0.000000 3.236115
A4GALT 0.000000 16.84551 3.759022 0.000000 2.157410
AAAS 15.363286 0.00000 0.000000 0.000000 0.000000
AACS 7.681643 50.53653 3.759022 0.000000 29.125039
AADACL2 0.000000 0.00000 7.518044 14.629089 0.000000
AADACP1 7.681643 0.00000 0.000000 3.657272 0.000000
AADAT 0.000000 0.00000 0.000000 10.971817 9.708346
AAGAB 11.522465 41.34807 71.421419 0.000000 38.833385
AAK1 84.498075 82.69614 186.071590 47.544540 161.805771
AAMDC 3.840822 6.12564 5.638533 0.000000 6.472231
AAMP 0.000000 6.12564 26.313154 0.000000 7.550936
AAR2 0.000000 0.00000 11.277066 3.657272 17.259282
# The end of the test samples and the beginning of the controls.
knitr::kable(normalized_count[1:15, 428:432], format = "html")
POS_436 POS_437 POS_438 NEG_001 NEG_002
A1BG 0.000000 12.915179 72.8645754 10.604908 13.0599705
A1CF 0.000000 0.000000 18.2993226 7.574934 0.7255539
A2M 351.104235 40.037055 7.6524440 27.774758 87.7920241
A2ML1 0.000000 10.332143 12.9758833 2.524978 5.8044313
A4GALT 0.000000 40.682814 2.9944346 4.544961 7.2555392
AAAS 0.000000 14.206697 3.6598645 17.674846 30.4732646
AACS 11.703475 43.265850 39.5930798 22.724802 7.2555392
AADACL2 0.000000 18.081251 18.2993226 105.039086 26.1199410
AADACP1 2.925869 10.977902 8.9833038 12.119894 0.7255539
AADAT 0.000000 2.583036 4.9907243 10.099912 0.7255539
AAGAB 32.184555 48.431921 17.3011777 31.309728 21.7666175
AAK1 190.181460 173.709158 164.3611883 176.243466 264.1016261
AAMDC 0.000000 12.915179 14.3067431 1.009991 1.4511078
AAMP 0.000000 7.749107 0.6654299 5.554952 5.0788774
AAR2 0.000000 13.560938 7.3197290 10.604908 9.4322009
# The end of the controls.
knitr::kable(normalized_count[1:15, 480:484], format = "html")
NEG_059 NEG_060 NEG_062 NEG_063 NEG_065
A1BG 1.187264 0.000000 0.000000 0.00000 2.256175
A1CF 0.000000 0.000000 0.000000 0.00000 4.512351
A2M 53.426879 0.000000 11.482416 0.00000 121.833471
A2ML1 3.561792 2.319754 0.000000 56.28336 4.512351
A4GALT 23.745280 4.639508 3.444725 7.18511 0.000000
AAAS 16.621696 15.078401 11.482416 0.00000 2.256175
AACS 22.558016 20.877786 25.261316 19.16029 29.330280
AADACL2 5.936320 1.159877 4.592967 0.00000 135.370523
AADACP1 5.936320 4.639508 4.592967 0.00000 18.049403
AADAT 9.498112 9.279016 9.185933 81.43125 4.512351
AAGAB 40.366976 28.996925 33.299007 0.00000 4.512351
AAK1 54.614143 31.316680 41.336698 79.03621 185.006382
AAMDC 9.498112 2.319754 2.296483 0.00000 11.280877
AAMP 47.490560 42.915450 41.336698 14.37022 0.000000
AAR2 11.872640 18.558032 18.371866 0.00000 6.768526
# The groups defined in the first assignment
samples <- read.table("samples.txt")
# The beginning of the dataset.
knitr::kable(samples[1:5, ], format = "html")
individual SARS.CoV.2
POS_001 1 POS
POS_002 2 POS
POS_003 3 POS
POS_004 4 POS
POS_005 5 POS
# The end of the test samples and the beginning of the controls.
knitr::kable(samples[428:432, ], format = "html")
individual SARS.CoV.2
POS_436 436 POS
POS_437 437 POS
POS_438 438 POS
NEG_001 1 NEG
NEG_002 2 NEG
# The end of the controls.
knitr::kable(samples[480:484, ], format = "html")
individual SARS.CoV.2
NEG_059 59 NEG
NEG_060 60 NEG
NEG_062 62 NEG
NEG_063 63 NEG
NEG_065 65 NEG

Note that some samples are omitted e.g.NEG_061, and further details about patients are not submitted by the author for a privacy reason.

Calculate P-Values

# Create DGEList object.
filtered_data_matrix <- as.matrix(normalized_count)
d <- edgeR::DGEList(counts=filtered_data_matrix, group=samples$cell_type)
# Create a design model.
model_design_pat <- model.matrix(~samples$SARS.CoV.2 + samples$individual)
# Estimate dispersion.
d <- edgeR::estimateDisp(d, model_design_pat)
# Calculate normalization factors.
d <- edgeR::calcNormFactors(d)
# MDS plot.
limma::plotMDS(d, 
               labels=rownames(samples), 
               col = c("darkgreen","blue")[factor(samples$SARS.CoV.2)],
               main = "MDS Plot")

Figure 1: A MDS plot of samples. Blue: SARS-CoV-2 positive patients, green: SARS-CoV-2 negative patients.

There is a clear separation between positive(blue) and negative(green) samples. This indicates that there is not big patient factor. Quasi likelihood test is applied here, because there are 484 samples, which is too much for likelihood ratio test.

# Create DGEList object.
filtered_data_matrix <- as.matrix(normalized_count)
d <- edgeR::DGEList(counts=filtered_data_matrix, group=samples$cell_type)
# Estimate Dispersion.
d <- edgeR::estimateDisp(d, model_design_pat)
# Fit the model.
fit <- edgeR::glmQLFit(d, model_design_pat)
# Obtain differential expression with Quasi likelihood test.
qlf.pos_vs_neg <- edgeR::glmQLFTest(fit, coef="samples$SARS.CoV.2POS")
# Get the result.
qlf_output <- edgeR::topTags(qlf.pos_vs_neg, sort.by = "PValue", n = nrow(normalized_count))
knitr::kable(head(qlf_output), format = "html", digits = 64)
logFC logCPM F PValue FDR
RPLP1 -4.622509 7.266273 251.3605 1.487846e-46 2.233257e-42
RPL13A -3.476013 8.491672 227.3249 5.948578e-43 4.464408e-39
RPL13 -3.210374 7.769872 191.9931 1.934494e-37 9.678919e-34
RPS18 -2.933320 7.886842 178.3334 3.104769e-35 1.165064e-31
RPS8 -3.137795 7.708668 175.7935 8.070958e-35 2.422902e-31
RPS3A -2.641507 8.238098 156.0558 1.529206e-31 3.825563e-28
x
BH
x
samples$SARS.CoV.2POS
x
glm
# The number of genes which pass the threshold (0.01).
length(which(qlf_output$table$PValue < 0.01))
## [1] 4272

There are 4272 genes showed significant differential expression. The threshold is 0.01 here, because the sample size is very large and the p-values tend to be higher.

Multiple Hypothesis Testing

For multiple hypothesis testing, Bonferroni method is not proper in this case, because there is a huge number of genes in this dataset and Bonferroni method tends to be too strict with large hypothesis. Therefore, FDR is used here. Threshold is set 0.01 for the same reason as p-value above.

# The number of genes which pass the threshold (0.01) with FDR.
length(which(qlf_output$table$FDR < 0.01))
## [1] 3132

3132 genes passes the correction with FDR. Correction with Bonferroni method is shown below.

# The number of genes which pass the threshold (0.01) with Bonferroni method.
length(which(qlf_output$table$PValue < 0.01 / length(qlf_output$table$PValue)))
## [1] 862

There are 862 genes passed the correction with Bonferroni method and this is much smaller than the one with FDR. However, there are still too many genes with 0.01 threshold and this causes some errors. Therefore, 0.001 is used as threshold instead in later cases.

# The number of genes which pass the threshold (0.001).
length(which(qlf_output$table$PValue < 0.001))
## [1] 2698
# The number of genes which pass the threshold (0.001) with FDR.
length(which(qlf_output$table$FDR < 0.001))
## [1] 1837

There are 2698 genes with p-value which pass 0.001 threshold, and 1837 genes which pass the correction.

Volcano Plot

# Apply grey for all at first.
color_volcano <- rep("grey", length(qlf_output$table$PValue))
# Red for up-regulated ones.
# Note 0.001, not 0.01 threshold is used here for the reason mentioned above.
color_volcano[qlf_output$table$logFC > 2 & qlf_output$table$FDR < 0.001] <- 'red'
# Blue for down-regulated ones.
# Note 0.001, not 0.01 threshold is used here for the reason mentioned above.
color_volcano[qlf_output$table$logFC < -2 & qlf_output$table$FDR < 0.001] <- 'blue'
# Volcano plot.
plot(qlf_output$table$logFC, 
     -log10(qlf_output$table$PValue), 
     col = scales::alpha(color_volcano, 0.3),
     pch = 16,
     xlab = "log2 fold change",
     ylab = "-log10 p",
     main = "Volcano Plot for SARS-CoV-2 Positive/Negative Patients"
)
legend("topright", 
       legend=c("Upregulated Genes","Downregulated Genes", "Not Significant Genes"), 
       fill = c("red", "blue", "grey"))

Figure 2: A volcano plot for SARS-CoV-2 positive/negative patients. Genes whose log2 fold change is more than 2 and FDR is less than the threshold is defined as up-regulated genes and colored in red. Down-regulated genes are those which has less than -2 logs fold change and less than the threshold FDR, and are shown in blue. Note that the threshold is set 0.001, not 0.01.

Heatmap

# Create a numerical matrix of genes with FDR less than the threshold.
# Note 0.001, not 0.01 threshold is used here for the reason mentioned above.
# Here is the error given when 0.01 threshold is used. This is probably because of too many genes and patients.
# Error in getFromNamespace(device_info[3], ns = device_info[2])(temp_image) : 
#  unable to open /var/folders/cc/myvsc01x5gs6_76g6xsmrb6m0000gn/T//RtmpXOogRM/.heatmap_body_e4228f848277c0ae2290f1a79e9515a1_1_144e32682e9a3.png
tophits <- normalized_count[rownames(normalized_count) %in% 
                                     rownames(qlf_output$table[which(qlf_output$table$FDR < 0.001),]),]
heatmap_matrix <- t(scale(t(tophits)))
# Set a color table.
if(min(heatmap_matrix) == 0){
  heatmap_col <- circlize::colorRamp2(c(0, max(heatmap_matrix)), c( "white", "red"))
}else{
  heatmap_col <- circlize::colorRamp2(c(min(heatmap_matrix), 0, max(heatmap_matrix)), c("blue", "white", "red"))
}
# Build a heatmap.
heatmap_tophits <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix),
                                   cluster_rows = TRUE,
                                   cluster_columns = TRUE,
                                   show_row_dend = TRUE,
                                   show_column_dend = TRUE,
                                   col = heatmap_col,
                                   show_column_names = TRUE,
                                   show_row_names = FALSE,
                                   show_heatmap_legend = TRUE,
                                   column_names_gp = grid::gpar(fontsize = 2)
                                   )
ComplexHeatmap::draw(heatmap_tophits, column_title = "Heatmap for SARS-CoV-2 Positive/Negative Patients")

Figure 3. Heatmap for genes with significant differential expression. SARS-CoV-2 negative patients are clustered on the left, although the font size of column names are so small that they can be hardly read. Note that the threshold is set 0.001, not 0.01.

In the bottom, there is a cluster of up-regulated genes in SARS-CoV-2 negative patients and such genes are downregulated in SARS-CoV-2 positive patients. Genes at the top tend to be more down-regulated in SARS-CoV-2 negative patients, but genes in SARS-CoV-2 positive patients are not up-regulated. This is probably because most patients are SARS-CoV-2 positive (430 out of 484) and genes that are up-regulated specifically in positive patients cannot give small enough FDR.

Thresholded Over-Representation Analysis

Method

Thresholded over-representation analysis is performed with g:Profiler. This is because g:Profiler was used in the lecture and is very handy for this assignment, especially with the web interface.

Annotation Data

GO biological process, Reactome, and WikiPathways are used here. This is because these datasets are for human. They were released in 2024-01-17, 2024-01-25, 2024-01-01 respectively.

Genesets

Up-/down-regulated genes are defined with the threshold of 0.001.

# Extract up-regulated genes.
# Note that the threshold is set 0.001, not 0.01.
up_genes <- rownames(qlf_output)[qlf_output$table$logFC > 0 & qlf_output$table$PValue < 0.001]
length(up_genes)
## [1] 929
# Write into text files.
write.table(up_genes, 
           "upregulated_genes.txt",
           sep = "\t",
           row.names = FALSE,
           col.names = FALSE,
           quote = FALSE
           )
# Extract up-regulated genes.
# Note that the threshold is set 0.001, not 0.01.
down_genes <- rownames(qlf_output)[qlf_output$table$logFC < 0 & qlf_output$table$PValue < 0.001]
length(down_genes)
## [1] 1769
# Write into text files.
write.table(down_genes, 
           "downregulated_genes.txt",
           sep = "\t",
           row.names = FALSE,
           col.names = FALSE,
           quote = FALSE
           )
# Extract genes for whole list.
# Note that the threshold is set 0.001, not 0.01.
whole_genes <- rownames(qlf_output)[qlf_output$table$PValue < 0.001]
# Write into text files.
write.table(whole_genes, 
           "whole_genes.txt",
           sep = "\t",
           row.names = FALSE,
           col.names = FALSE,
           quote = FALSE
           )

There are 929 up-regulated genes and rlength(down_genes) down-regulated genes.

Run the thresholded over-representation analysis

The thresholded over-representation analysis is performed with the web interface (https://biit.cs.ut.ee/gprofiler/gost). BH FDR is used for significance threshold and the threshold is 0.001. GO:BP result for up-regulated genes

Figure 4: g:Profiler result of up-regulated genes from GO biological process. Top 10 results are shown here.

Reactome result for up-regulated genes
Reactome result for up-regulated genes

Figure 5: g:Profiler result of up-regulated genes from Reactome. Top 10 results are shown here. Pathways shown in grey indicates that they are above the threshold.

WikiPathways result for up-regulated genes
WikiPathways result for up-regulated genes

Figure 6: g:Profiler result of up-regulated genes from WikiPathways. Top 10 results are shown here.

Almost all the pathways for up-regulated genes are related to immune system. Note that among such immune systems, pathways involved in a response to viruses are dominant. This looks reasonable as a up-regulated genes for SARS-CoV-2 positive patients.

GO:BP result for down-regulated genes
GO:BP result for down-regulated genes

Figure 7: g:Profiler result of down-regulated genes from GO biological process. Top 10 results are shown here.

Reactome result for down-regulated genes
Reactome result for down-regulated genes

Figure 8: g:Profiler result of down-regulated genes from Reactome. Top 10 results are shown here.

WikiPathways result for down-regulated genes
WikiPathways result for down-regulated genes

Figure 9: g:Profiler result of down-regulated genes from WikiPathways. Top 10 results are shown here. Pathways shown in grey indicates that they are above the threshold.

For down-regulated genes, there are many pathways for translation. I found it interesting that there are pathways for respiration system, although the reason is unclear. GO:BP result for whole list

Figure 10: g:Profiler result of whole list from GO biological process. Top 10 results are shown here.

Reactome result for whole list
Reactome result for whole list

Figure 11: g:Profiler result of whole list from Reactome. Top 10 results are shown here.

WikiPathways result for whole list
WikiPathways result for whole list

Figure 12: g:Profiler result of whole list from WikiPathways. Top 10 results are shown here. Pathways shown in grey indicates that they are above the threshold.

For the whole list, these are apparently a combination of up-/down-regulated reason and not difference found.

Interpretation

Comparison to the Original Paper

The result mostly supports the original paper. Lieberman et.al. reported that up-regulation of anti-viral factors and down-regulation of genes related to ribosomes. However, they also state that reduction in transcription of some interferon-related genes. There is no pathways about interferon in g:Profiler result for down-regulated genes, but there is in the result for up-regulated genes. This might be because of different annotation sources form the original paper. In addition, g:Profiler result does not directly contradict to the original paper because increased expression of interferon response genes are also reported in the original paper.(Lieberman 2020)

Publications to Support the Result

Chen et. al. report that transcription of ACE2, which is a interferon-responsive genes, significantly increase in Asian females, moderately decrease in elderly people in all ethnic groups, and highly significant decrease in type II diabetic patients(Chen et al. 2020). Therefore, patients variation like ethnicity, sex or health condition may have a huge impact on the interferon related gene transcription but the impact is not evident in this report because details about patients are not released in Lieberman et.al. paper for privacy reason. (Lieberman 2020).

Reference

Chen, Jiawei, Quanlong Jiang, Xian Xia, Kangping Liu, Zhengqing Yu, Wanyu Tao, Wenxuan Gong, and Jing-Dong J. Han. 2020. “Individual Variation of the SARS-CoV-2 Receptor ACE2 Gene Expression and Regulation.” Aging Cell 19 (7): e13168. https://doi.org/https://doi.org/10.1111/acel.13168.
Lieberman, Vikas AND Xie, Nicole A. P. AND Peddu. 2020. “In Vivo Antiviral Host Transcriptional Response to SARS-CoV-2 by Viral Load, Sex, and Age.” PLOS Biology 18 (9): 1–17. https://doi.org/10.1371/journal.pbio.3000849.
Steeg, Sabra L. vom, Landon G. AND Klein. 2016. “SeXX Matters in Infectious Disease Pathogenesis.” PLOS Pathogens 12 (2): 1–6. https://doi.org/10.1371/journal.ppat.1005374.
WHO. 2020. “WHO Coronavirus (COVID-19) Dashboard.” https://covid19.who.int.